A Wikipedia-Based Multilingual Retrieval Model
نویسندگان
چکیده
This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document di chosen from the “L-subset” of Wikipedia. Likewise, for a second document d′ written in language L′, L = L′, we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts d′∗ i of our previously chosen documents. Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection.
منابع مشابه
I2R At ImageCLEF Wikipedia Retrieval 2010
We report on our approaches and methods for the ImageCLEF 2010 Wikipedia image retrieval task. A distinctive feature of this year’s image collection is that images are associated with unstructured and noisy textual annoations in three languages: English, French and German. Hence, besides following conventional text-based and multimodal approaches, we also focus some effort into investigating mu...
متن کاملDbnary: Wiktionary as a LMF based Multilingual RDF network
Contributive resources, such as wikipedia, have proved to be valuable in Natural Language Processing or Multilingual Information Retrieval applications. This article focusses on Wiktionary, the dictionary part of the collaborative resources sponsored by the Wikimedia
متن کاملDbnary: Wiktionary as a Lemon Based RDF Multilingual Lexical Resource
Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our effort to extract multilingual lexical data from Wiktionary data and to provide it to the community as...
متن کاملUAIC's Participation at Wikipedia Retrieval @ ImageCLEF 2011
This paper describes the participation of UAIC team at the ImageCLEF 2011 competition, Wikipedia Retrieval task. The aim of the task was to investigate retrieval approaches in the context of a large and heterogeneous collection of images and their noisy text annotations. We submitted a total of six runs, focusing our effort along the textual retrieval, query expansion on English language, combi...
متن کاملDBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF
Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our extraction of multilingual lexical data from Wiktionary data and to provide it to the community as a M...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008